Slovene-English Datasets for MT
ثبت نشده
چکیده
Advances in machine translation are becoming increasingly dependent on the availability of large scale language resources, in particular parallel corpora. The talk presents Slovene-English language resources that were developed as datasets for translation studies and machine learning programs. Three parallel datasets are introduced: the MULTEXT-East multilingual word-annotated corpus, the IJS-ELAN Slovene-English parallel corpus, and the Concede English-Slovene dictionary fragment and lexical database.
منابع مشابه
Quality Estimation for Synthetic Parallel Data Generation
This paper presents a novel approach for parallel data generation using machine translation and quality estimation. Our study focuses on pivot-based machine translation from English to Croatian through Slovene. We generate an English–Croatian version of the Europarl parallel corpus based on the English–Slovene Europarl corpus and the Apertium rule-based translation system for Slovene–Croatian. ...
متن کاملBuilding Language Resources and Translation Models for Machine Translation Focused on South Slavic and Balkan Languages
The aim of this short-term project was to investigate the feasibility of machine translation (MT) research and development for several South Slavic and Balkan languages, more precisely Romanian, Bulgarian, Slovene, Greek and Serbian. For these languages, MT systems are scarce and for some of them even non-existent. We provide a brief description of the project’s major research tasks: Compilatio...
متن کاملWere the clocks striking or surprising? Using WSD to improve MT performance
We report on a series of experiments aimed at improving the machine translation of ambiguous lexical items by using wordnet-based unsupervised Word Sense Disambiguation (WSD) and comparing its results to three MT systems. Our experiments are performed for the English-Slovene language pair using UKB, a freely available graph-based word sense disambiguation system. Results are evaluated in three ...
متن کاملProducing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair
This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool fo...
متن کاملJapanese to English/Chinese/Korean Datasets for Translation Quality Estimation and Automatic Post-Editing
Aiming at facilitating the research on quality estimation (QE) and automatic post-editing (APE) of machine translation (MT) outputs, especially for those among Asian languages, we have created new datasets for Japanese to English, Chinese, and Korean translations. As the source text, actual utterances in Japanese were extracted from the log data of our speech translation service. MT outputs wer...
متن کامل